Multi-task deep cross-attention networks for far-field speaker verification and keyword spotting
نویسندگان
چکیده
Abstract Personalized voice triggering is a key technology in assistants and serves as the first step for users to activate assistant. involves keyword spotting (KWS) speaker verification (SV). Conventional approaches this task include developing KWS SV systems separately. This paper proposes single system called multi-task deep cross-attention network (MTCANet) that simultaneously performs SV, while effectively utilizing information relevant both tasks. The proposed framework integrates sub-network an enhance performance challenging conditions such noisy environments, short-duration speech, model generalization. At core of MTCANet are three modules: novel (DCA) module integrate tasks, multi-layer stacked shared encoder (SE) reduce impact noise on recognition rate, soft attention (SA) modules allow focus pertinent middle layer preventing gradient vanishing. Our demonstrates outstanding well-off test set, improving by 0.2%, 0.023, 2.28% over well-known emphasized channel attention, propagation, aggregation time delay neural (ECAPA-TDNN) advanced Convmixer terms equal error rate (EER), minimum detection cost function (minDCF), accuracy (Acc), respectively.
منابع مشابه
Multi-Task Learning and Weighted Cross-Entropy for DNN-Based Keyword Spotting
We propose improved Deep Neural Network (DNN) training loss functions for more accurate single keyword spotting on resource-constrained embedded devices. The loss function modifications consist of a combination of multi-task training and weighted cross entropy. In the multi-task architecture, the keyword DNN acoustic model is trained with two tasks in parallel the main task of predicting the ke...
متن کاملTransferable Deep Features for Keyword Spotting
Deep features, defined as the activations of hidden layers of a neural network, have given promising results applied to various vision tasks. In this paper, we explore the usefulness and transferability of deep features, applied in the context of the problem of keyword spotting (KWS). We use a state-ofthe-art deep convolutional network to extract deep features. The optimal parameters concerning...
متن کاملConfidence Measure for Utterance Verification in Keyword Spotting System
In this article, we propose an utterance verification technique for keyword spotting. The keyword spotting system analyzes a given spoken content and searches every speech segment in which one of pre-defined keywords is uttered. To maintain a stable recognition performance in the system, we propose an utterance verification technique that verifies whether a found utterance, or a candidate keywo...
متن کاملMulti-task learning for text-dependent speaker verification
Text-dependent speaker verification uses short utterances and verifies both speaker identity and text contents. Due to this nature, traditional state-of-the-art speaker verification approaches, such as i-vector, may not work well. Recently, there has been interest of applying deep learning to speaker verification, however in previous works, standalone deep learning systems have not achieved sta...
متن کاملSpoken keyword spotting via multi-lattice alignment
We propose a method for finding keywords in an audio database using a spoken query. Our method is based on performing a joint alignment between a phone lattice generated from a spoken utterance query and a second phone lattice representing a long utterance needing to be searched. We implement this joint alignment procedure in a graphical models framework. We evaluate our system on TIMIT as well...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Eurasip Journal on Audio, Speech, and Music Processing
سال: 2023
ISSN: ['1687-4722', '1687-4714']
DOI: https://doi.org/10.1186/s13636-023-00293-8